Steps for preprocessing sequencing data to generate cell x gene expression matrix: -

1. Run sort_gzfastq function in bacteria_rnaseq.py to sort the paired-end sequencing reads into two files - one containing the biological cDNA sequence data, 
and the other the cell barcode and UMI metadata. 

2. Run trim_bio_cdna function in bacteria_rnaseq.py to extract the cDNA sequence from each read and store in FASTQ format with the corresponding metadata sequence
as the read name. 

3. Run the trim_meta function in bacteria_rnaseq.py to put all the reads with legitimate cell barcode sequence in one FASTQ file, the cell barcode and UMI constituting
the read name. 

4. Run 'valid_bc_sep_file' in bacteria_rnaseq.py to split output fastq file into multiple fastq files, each corresponding to a different cell barcode, 
and remove reads with the same sequence for the same UMI and cell barcode.

5. Run count_reads_distribution to find out the mapped UMI thresold by generating the knee plot of mapped UMIs vs cell barcodes. 
Also count total reads split across the different cell barcode files.

6. Run filter_cutoff_cells.py to filter the fastq files (cell barcodes) that have mapped UMIs above the UMI threshold into a separate new folder.

7. Run deduplicate_for_align.py to get rid of reads with duplicate UMIs and read labels in each of the fastq files before running the alignment.

8. Run the command: rsem-prepare-reference --bowtie2 --gtf Bacterial_strain.gtf Bacterial_strain_fasta.fa prefix_index_files 
to prepare transcriptome index of the bacterial strain for the downstream bowtie2-RSEM alignment. 

9. Run the RSEM-bowtie2 alignment program alignment_crelox.sh on each of the cell barcode files cleaned up from Step 7. 

10. Run the rsem_df function in bacteria_rnaseq.py to get the final cell x gene matrix from the output of the alignment operation.



